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Restricted Boltzmann machines are undirected neural networks which have been shown to be effective in 
many applications, including serving as initializations for training deep multi-layer neural networks. One of the 
main reasons for their success is the existence of efficient and practical stochastic algorithms, such as contrastive 
divergence, for unsupervised training. We propose an alternative deterministic iterative procedure based on an 
improved mean field method from statistical physics known as the Thouless-Anderson-Palmer approach. We 
demonstrate that our algorithm provides performance equal to, and sometimes superior to, persistent contrastive 
divergence, while also providing a clear and easy to evaluate objective function. We believe that this strategy 
can be easily generalized to other models as well as to more accurate higher-order approximations, paving the 
way for systematic improvements in training Boltzmann machines with hidden units. 


I. INTRODUCTION 

A restricted Boltzmann machine (RBM) lH El is a type of undirected neural network with surprisingly many applications. 
This model has been used in problems as diverse as dimensionality reduction 0. classification 0, collaborative filtering 0, 
feature learning 0, and topic modeling 0. Also, quite remarkably, it has been shown that generative RBMs can be stacked 
into multi-layer neural networks, forming an initialization for discriminative deep belief nets El- Such deep architectures are 
believed to be crucial for learning high-order representations and concepts. 

While the training procedure for RBMs can be written as a log-likelihood maximization, an exact implementation of this ap¬ 
proach is computationally intractable for all but the smallest models. However, fast stochastic Monte Carlo methods, specifically 
contrastive divergence (CD) j2] and persistent CD (PCD) 0, have made large-scale RBM training both practical and efficient. 
These methods have popularized RBMs even though it is not entirely clear why such approximate methods should work as well 
as they do. 

In this paper, we propose an alternative deterministic strategy for training RBMs, and neural networks with hidden units in 
general, based on the so-called mean field, and extended mean field, methods of statistical mechanics. This strategy has been 
used to train neural networks in a number of earlier works EM!. In fact, for entirely visible networks, the use of adaptive 
cluster expansion mean field methods have lead to spectacular results in learning Boltzmann machine representations DSH21. 

However, unlike these fully visible models, the hidden units of the RBM must be taken into account during the training pro¬ 
cedure. In 2002, Welling and Hinton fl4l presented a similar deterministic mean field learning algorithm for general Boltzmann 
machines with hidden units, considering it a priori as a potentially efficient extension of CD. In 2008, Tieleman 0 tested the 
method in detail for RBMs and found it provided poor performance when compared to both CD and PCD. In the wake of these 
two papers, little inquiry has been made in this direction, with the apparent consensus being that the deterministic mean field 
approach is ineffective for RBM training. 

Our goal is to challenge this consensus by going beyond naive mean field, a mere first-order approximation, by introducing 
second-, and possibly third-, order terms. In principle, it is even possible to extend the approach to arbitrary order. Using this 
extended mean field approximation, commonly known as the Thouless-Anderson-Palmer lfl8l approach in statistical physics, we 
find that RBM training performance is significantly improved over the naive mean field approximation and is even comparable 
to PCD. The clear and easy to evaluate objective function, along with the extensible nature of the approximation, paves the way 
for systematic improvements in learning efficiency. 


II. TRAINING RESTRICTED BOLTZMANN MACHINES 

A restricted Boltzmann machine, which can be viewed as a two layer undirected bipartite neural network, is a specific case of 
an energy based model wherein a layer of visible units are fully connected to a layer of hidden units. Let us denote the binary 
visible and hidden units, indexed by i and j respectively, as v, and hj. The energy of a given state, v = {•(■,}, h = { h :j }, of the 
RBM is given by 


E(v, h) 


i j i,j 


(i) 
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where Wij are the entries of the matrix specifying the weights, or couplings, between the visible and hidden units, and «, and bj 
are the biases, or the external fields in the language of statistical physics, of the visible and hidden units, respectively. Thus, the 
set of parameters {wij, a,i, bj} define the RBM model. 

The joint probability distribution over the visible and hidden units is given by the Gibbs-Boltzmann measure P(v, h) = 
Z~ 1 e~ EI ' v ’ h) , where Z = h e" , ' :iv - h; is the normalization constant known as the partition function in physics. For a given 
data point, represented by v, the marginal of the RBM is calculated as P(v) = ]P h P(v, h). Writing this marginal of v in terms 
of its log-likelihood results in the difference 


C = In P(v) = -P c (v) + P, (2) 

where F = — In Z is the free energy of the RBM, and F c (v) = — ln(]P h f_= E! ' v ' h ' r ) can be interpreted as a free energy as well, 
but with visible units fixed to the training data point v. Hence, F c is referred to as the clamped free energy. 

One of the most important features of the RBM model is that F c can be easily computed as h may be summed out analytically 
since the hidden units are conditionally independent of the visible units, owing to the RBM’s bipartite structure. However, 
calculating F is computationally intractable since the number of possible states to sum over scales combinatorially with the 
number of units in the model. This complexity frustrates the exact computation of the gradients of the log-likelihood needed 
in order to train the RBM parameters via gradient ascent. Monte Carlo methods for RBM training rely on the observation that 
= P(v, = 1, hj = 1), which can be simulated at a lower computational cost. Nevertheless, drawing independent samples 
from the model in order to approximate this derivative is itself computationally expensive and often approximate sampling 
algorithms, such as CD or PCD, are used instead. 


III. EXTENDED MEAN FIELD THEORY OF RBMS 

Here, we present a physics-inspired tractable estimation of the free energy F of the RBM. This approximation is based on a 
high temperature expansion of the free energy derived by Georges and Yedidia in the context of spin glasses |fl9l following the 
pioneering works of II1811201 . We refer the reader to lf2ll for a review of this topic. 

To apply the Georges-Yedidia expansion to the RBM free energy, we start with a general energy based model which possesses 
arbitrary couplings between undifferentiated binary spins .s, € {0,1}, such that the energy of the Gibbs-Boltzmann measure 
on the configuration s = {s^} is defined by E( s) = — JY — JT . WijSjSj. We also restore the role of the temperature, 
usually set to 1 in most energy based models, by multiplying the energy functional in the Boltzmann weight by the inverse 
temperature p. 

Next, we apply a Legendre transform to the free energy, a standard procedure in statistical physics, by first writing the free 
energy as a function of a newly introduced auxiliary external field q = {(?, }, —/LF[q] = In e.-P E 0 s )+PY.i qiSi . This external 
field will be eventually set to the value q = 0 in order to recover the true free energy. The Legendre transform T is then given as 
a function of the conjugate variable m = {m,} by maximizing over q, 

— /3r[m] = -/3max[F[q] = -/3(F[q*[m]] (3) 

i i 

where the maximizing auxiliary field q*[m], a function of the conjugate variables, is the inverse function of m[q] = . 

Since the derivative ^ is exactly equal to — (s), where the operator (•) refers to the average configuration under the Boltzmann 
measure, the conjugate variable m is in fact the equilibrium magnetization vector (s). Finally, we observe that the free energy is 
also the inverse Lengendre transform of its Legendre transform at q = 0, 

— pF = —/3F[ q = 0] = ^min[r[m]] = — /3r[m*], (4) 

m 


where m* minimizes F. 

Following Ifl9ll20l , this formulation allows us to perform a high temperature expansion of A(3, m) = -flljinj around 3 = 0 
at fixed m. 


A(3, m) = A(0, m) + 3 


dA(3, m) 
^P~ 


+ 


3=0 


d 2 A(P, m) 
dp 2 



(5) 


where the dependence on P of the product /iq must carefully be taken into account. At infinite temperature, 3 = 0, the spins 
decorrelate, causing the average value of an arbitrary product of spins to equal the product of their local magnetizations; a useful 
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property. Accounting for binary spins taking values in {0,1}, one obtains the following expansion 

—/?r(m) = — ^2 [rrii In to, + (1 — rrii) ln(l — to,)] + /f airrii + /3 Wy-miTOj 
* » (hi) 



+ /i 3 ^ WijWj k Wki(mi - mf)(mj - m 2 )(m k - m 2 k ) H- (6) 

(hiA) 


The zeroth-order term corresponds to the entropy of non-interacting spins with constrained magnetizations values. Taking this 
expansion up to the first-order term, we recover the standard naive mean field theory. The second-order term is known as the 
Onsager reaction term in the TAP equations II 811 . The higher orders terms are systematic corrections which were first derived 
in d. 

Returning to the RBM notation and truncating the expansion at second-order for the remainder of the theoretical discussion, 
we have 


T(m v , m h ) ~ S^m", m / '') — ^ apn'i — ^ bj 


- 5Z "V">/ + ^( m i - - m f)i 


h3 


(7) 


where S is the entropy contribution, m v and m h are introduced to denote the magnetization of the visible and hidden units, and 
/? is set equal to 1. Eq. ([7]) can be viewed as a weak coupling expansion in Wij. To recover an estimate of the RBM free energy, 
Eq. ([7]i must be minimized with respect to its arguments, as in Eq. <(4]). Lastly, by writing the stationary condition j? r = 0, we 
obtain the self-consistency constraints on the magnetizations. For instance, at second-order we obtain the following constraint 
on the visible magnetizations. 



( 8 ) 


where sigm[x] = (1 + e _x ) _1 is a logistic sigmoid function. A similar constraint must be satisfied for the hidden units, as well. 
Clearly, the stationarity condition for T obtained at order n utilizes terms up to the n th order within the sigmoid argument of 
these consistency relations. Whatever the order of the approximation, the magnetizations are the solutions of a set of non-linear 
coupled equations of the same cardinality as the number of units in the model. Finally, provided we can define a procedure to 
efficiently derive the value of the magnetizations satisfying these constraints, we obtain an extended mean field approximation 
of the free energy which we denote as f* lEMF . 


IV. RBM EVALUATION AND UNSUPERVISED TRAINING WITH EMF 
A. An iteration for calculating F emf 

Recalling the log-likelihood of the RBM, £ = —F c (v) + F, we have shown that a tractable approximation of F, F emf , 
is obtained via a weak coupling expansion so long as one can solve the coupled system of equations over the magnetizations 
shown in Eq. (|8j. In the spirit of iterative belief propagation |2T1 . we propose that these self-consistency relations can serve as 
update rules for the magnetizations within an iterative algorithm. In fact, the convergence of this procedure has been rigorously 
demonstrated in the context of random spin glasses Ell . We expect that these convergence properties will remain present even 
for real data. The iteration over the self-consistency relations for both the hidden and visible magnetizations can be written using 
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the time index t as 


rrij [i + 1] -<— sigm 
to" [£ + !]•<— sigm 


b j + Wi i m i M “ w ij ( m j W “ ( m ?M - ( m i M) 2 ) > 

i ' ' 

«i + 5Z [* + !] - w ij (k’M - ^ K[* + 1] - (”*?[* + l]) 2 ) 


(9, 10) 


where the time indexing follows from careful application of G2). The values of m 1 and m /l minimizing r(m", m , ‘), and thus 
providing the value of /<’ FMF , are obtained by running Eqs. ( |9, 10[ ) until they converge to a fixed point. We note that while we 
present an iteration to find .F emf up to second-order above, third-order terms can easily be introduced into the procedure. 


B. Deterministic EMF training 


By using the EMF estimation of F, and the iterative algorithm detailed in the previous section to calculate it, it is now possible 
to estimate to gradients of the log-likelihood used for unsupervised training of the RBM model by substituting F with F emf . 
We note that the deterministic iteration we propose for estimating F is in stark contrast with the stochastic sampling procedures 
utilized in CD and PCD to the same end. For instance, the gradient ascent update of weight Wij is approximated as 


A oc 


dC 

dw i: j 


dF c <TF emf 

dwij dwij ’ 


( 11 ) 


where dw can be computed by differentiating Eq. <[7J) at fixed m" and m and computing the value of these derivatives at 
the fixed points of Eqs. (|9, 1 0J> obtained from the iterative procedure. The gradients with respect to the visible and hidden biases 

or,EMF o nEMF 

can be derived similarly. Interestingly, i)n — and f)h — are merely the fixed-point magnetizations of the visible and hidden 
units, to" and m F , respectively. 

A priori, the training procedure sketched above can be used at any order of the weak coupling expansion. The training 
algorithm introduced in lfT4l . which was shown to perform poorly for RBM training in (9J, can be recovered by retaining only 
the first-order of the expansion when calculating F emf . By taking F emf to second-order, we expect that training efficiency 
and performance will be greatly improved over M- In fact, including the third-order term in the training algorithm is just as 
easy as including the second-order one, due to the fact that the particular structure of the RBM model does not admit triangles 
in its corresponding factor graphs. Although the third-order term in Eq. © does include a sum over distinct pairs of units, as 
well as a sum over coupled triplets of units, such triplets are excluded by the bipartite structure of the RBM. However, coupled 
quadruplets do contribute to the fourth-order term and therefore fourth- and higher-order approximations require much more 
expensive computations CGD, though it is possible to utilize adaptive procedures as in E6). 


V. NUMERICAL EXPERIMENTS 
A. Experimental framework 

To evaluate the performance of the proposed deterministic EMF RBM training algorithm, we perform a number of numerical 
experiments over two separate datasets and compare these results with both CD-I and PCD. We first use the MNIST dataset of 
labeled handwritten digit images li23il . The dataset is split between 60 000 training images and 10 000 test images. Both subsets 
contain approximately the same fraction of the ten digit classes (0 to 9). Each image is comprised of 28 x 28 pixels taking values 
in the range [0, 255]. The MNIST dataset was binarized by setting all non-zero pixels to 1 in all experiments, with the exception 
of one experiment in which we train on a version of the MNIST dataset rescaled to [0,1]. 

Second, we use the 28 x 28 pixel version of the Caltech 101 Silhouette dataset |[24l . Constructed from the Caltech 101 image 
dataset, the silhouette dataset consists of black regions of the primary foreground scene object on a white background. The 
images are labeled according to the object in the original picture, of which there are 101 unevenly represented object labels. The 
dataset is split between a training (4 100 images), a validation (2 264 images), and a test (2 304 images) sets. 

For both datasets, the RBM models require 784 visible units. Following previous studies evaluating RBMs on these datasets, 
we fix the number of RBM hidden units to 500 in all our experiments. During training, we adopt the mini-batch learning 
procedure for gradient averaging, with 100 training points per batch for MNIST and 256 training points per batch for Caltech 
101 Silhouette. 
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FIG. 1: Estimates of the per-sample log-likelihood over the MNIST test set, normalized by the total number of units, as a function of the 
number of training epochs for a 500 hidden unit RBM. The results for the different training algorithms are plotted in different colors with the 
same color code used for both panels. Left panel : Pseudo log-likelihood estimate. The difference between EMF algorithms and contrastive 
divergence algorithms is minimal. Right panel : EMF log-likelihood estimate at 2 nd order. The improvement from MF to TAP is clear. 
Perhaps reasonably, TAP demonstrates an advantage over CD and PCD. Notice how the second-order EMF approximation of C provides less 
noisy estimates, at a lower computational cost. 


We test the EMF learning algorithm presented in Section [TV B| in various settings. First, we compare implementations utilizing 
the first-order (MF), second-order (TAP2), and third-order (TAP3) approximations of F. Higher orders were not considered due 
to their greater complexity. Next, we investigate training quality when the self-consistency relations on the magnetizations were 
not converged when calculating the derivatives of F emf , but instead iterated for only a small fixed (3) number of iterations, 
an approach similar to CD-I. Furthermore, we also evaluate a “persistent” version of our algorithm, similar to 0. In this 
implementation, the magnetizations of a set of points, dubbed fantasy particles, are updated and maintained throughout the 
training in order to estimate F. This persistent procedure takes advantage of the fact that the RBM-defined Boltzmann measure 
changes only slightly between training epochs. Convergence to the new fixed point magnetizations at each epoch should therefore 
be sped up by initializing with the converged state from the previous update. Our final experiments consist of persistent training 
algorithms using 3 iterations of the magnetization self-consistency relations (P-MF, P-TAP2 and P-TAP3) and one persistent 
training algorithm using 30 iterations (P-TAP2-30) for comparison. 

Lastly, we evaluate RBM training for the rescaled, non-binarized, MNIST dataset (P-TAP2 raw). This experiment is designed 
to mimic the pre-training of a second stacked RBM of a deep belief net. In this setting, the training data for the second RBM 
consists of non-binary magnetizations derived from the hidden units of first RBM operating on the true binary training data 
US- For this experiment, EMF iterations were used to estimate both the clamped term as well as the free energy term in the 
computation of the log-likelihood gradients. 

For comparison, we also train RBM models using CD-I, following the prescriptions of l25l . and PCD, as implemented by 
Tieleman 0. Given that our goal is to compare RBM training approaches rather than achieving the best possible training across 
all free parameters, neither momentum nor adaptive learning rates were included in any of the implementations tested. However, 
we do employ a weight decay regularization in all our trainings to keep weights small; a necessity for the weak coupling 
expansion on which the EMF relies. When comparing learning procedures on the same plot, all free parameters of the training 
(e.g. learning rate, weight decay, etc.) were set identically. All results are presented as averages over 10 independent trainings 
with standard deviations reported as error bars. 


B. Relevance of the EMF log-likelihood 

Our first observation is that the implementations of the EMF training algorithms are not overly belabored. The free parameters 
relevant for the PCD and CD-I procedures were found to be equally well suited for the EMF training algorithms. In fact, as 
shown in the left panel of Fig. [T] and the right inset of Fig. [3] the ascent of the pseudo log-likelihood over training epochs is 
very similar between the EMF training methods and both the CD-I and PCD trainings. 
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FIG. 2: Fantasy particles generated by a 500 hidden unit RBM after 50 epochs of training on the MNIST dataset with PCD (top two rows), 
P-MF (middle two rows) and P-TAP2 (bottom two rows). These fantasy particles represent typical samples generated by the trained RBM 
when used as a generative prior for handwritten numbers. The samples generated by P-TAP2 are of similar subjective quality, and perhaps 
slightly preferable, to those generated by PCD. while certainly preferable to those generated by P-MF. 




FIG. 3: Test set classification accuracy for the MNIST (left) and Caltech 101 Silhouette (right) datasets using logistic regression on the 
hidden unit marginal probabilities as a function of the number of epochs used to train a 500 hidden unit RBM. As a baseline comparison, 
the classification accuracy of logistic regression performed directly on the data is given as a black dashed line. Here, P-TAP2 raw refers to 
the P-TAP2 training algorithm performed on the rescaled, non-binarized, version of the MNIST dataset. The results for the different training 
algorithms are displayed in different colors, with the same color code being used in both panels. (Right inset:) Pseudo log-likelihood over 
training epochs for the Caltech 101 Silhouette dataset. 


Interestingly, for the Caltech 101 Silhouettes dataset, it seems that the persistent algorithms tested have difficulties in ascending 
the pseudo-likelihood in the first epochs of training. This contradicts the common belief that persistence yields more accurate 
approximations of the likelihood gradients. The complexity of the training set, 101 classes unevenly represented over only 4100 
training points, might explain this unexpected behavior. The persistent fantasy particles all converge to similar non-informative 
blurs in the earliest training epochs with many epochs being required to resolve the particles to a distribution of values which are 
informative about the pseudo log-likelihood. 

Examining the fantasy particles also gives an idea of the performance of the RBM as a generative model. In Fig. [2j 24 
randomly chosen fantasy particles from the 50 th epoch of training with PCD, P-MF, and P-TAP2 are displayed. The RBM 
trained with PCD generates recognizable digits, yet the model seems to have trouble generating several digit classes, such as 3, 
8, and 9. The fantasy particles extracted from a P-MF training are of poorer quality, with half of the drawn particles featuring 
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non-identifiable digits. The P-TAP2 algorithm, however, appears to provide qualitative improvements. All digits can be visually 
discerned, with visible defects found only in two of the particles. These particles seem to indicate that it is indeed possible to 
efficiently persistently train an RBM without converging on the fixed point of the magnetizations. 

The relevance of the EMF log-likelihood for RBM training is further confirmed in the right panel of Fig. |T] where we 
observe that both CD-I and PCD ascend the second-order EMF log-likelihood, even though they are not explicitly constructed to 
optimize over this objective. As expected, the persistent TAP2 algorithm with 30 iterations of the magnetizations (P-TAP2-30) 
achieves the best maximization of C EMF . However, P-TAP2, with only 3 iterations of the magnetizations, achieves very similar 
performance, perhaps making it preferable when a faster training algorithm is desired. Moreover, we note that although P-TAP2 
demonstrates improvements with respect to the P-MF, the P-TAP3 does not yield significantly better results than P-TAP2. This 
is perhaps not surprising since the third order term of the EMF expansion consists of a sum over as many terms as the second 
order, but at a smaller order in { w 7J }. 


C. Classification task performance 

We also evaluate these RBM training algorithms from the perspective of supervised classification. An RBM can be interpreted 
as a deterministic function mapping the binary visible unit values to the real-valued hidden unit magnetizations. In this case, 
the hidden unit magnetizations represent the contributions of some learned features. Although no supervised fine-tuning of the 
weights is implemented, we tested the quality of the features learned by the different training algorithms by their usefulness 
in classification tasks. For both datasets, a logistic regression classifier was calibrated with the hidden units magnetizations 
mapped from the labeled training images using the scikit-learn toolbox l26l . We purposely avoid using more sophisticated 
classification algorithms in order to place emphasis on the quality of the RBM training, not the classification method. 

In Fig. [3] we see that the MNIST classification accuracy of the RBMs trained with the P-TAP2 algorithms is roughly equivalent 
with that obtained when using PCD training, while CD-I training yields markedly poorer classification accuracy. The slight 
decrease in performance of CD-I and TAP2 along as the training epochs increase might be emblematic of over-fitting by the 
non-persistent algorithms, although no decrease in the EMF test set log-likelihood was observed. We note that the classification 
accuracy for the RBM trained on the rescaled MNIST data (P-TAP2 raw) is marginally better than the other tested approaches, 
which implies that the training the RBM on real-valued data was successful. Consequently, even if our algorithm is designed for 
binary visible units, it can be used equally with real visible variables by treating them as magnetizations, just as is commonly 
done with the CD-I algorithm. 

Finally, for the Caltech 101 Silhouettes dataset, the classification task, shown in the right panel of Fig. [3] is much more 
difficult a priori. Interestingly, the persistent algorithms do not yield better results on this task. However, we observe that the 
performance of deterministic EMF RBM training is at least comparable with both CD-I and PCD. 


VI. CONCLUSION 

We have presented a method for training RBMs based on an extended mean field approximation. Although a naive mean 
field learning algorithm had already been designed for RBMs, and judged unsatisfactory 0na, we have shown that extending 
beyond the naive mean field to include terms of second-order and above brings significant improvements over the first-order 
approach and allows for practical and efficient deterministic RBM training with performance comparable to the stochastic CD 
and PCD training algorithms. A demo file, with an implementation of our algorithm, is provided online l28l . 

The extended mean field theory also provides an estimate of the RBM log-likelihood which is easy to evaluate and thus 
enables practical monitoring of the progress of unsupervised learning throughout the training epochs. Furthermore, training 
on real-valued magnetizations is theoretically well-founded within the presented approach and was shown to be successful in 
experimentation. These results pave the way for many possible extensions. For instance, it would be quite straightforward to 
apply the same kind of expansion to Gauss-Bernoulli RBMs, as well as to multi-label RBMs. 

The extended mean field approach might also be used to learn stacked RBMs jointly, rather than separately, as is done in 
both deep Boltzmann machine and deep belief network pre-training, a strategy that has shown some promise ll27ll . In fact, the 
approach can be generalized even to non-restricted Boltzmann machines with hidden variables with very little difficulty. Another 
interesting possibility would be to make use of higher-order terms in the series expansion using adaptive cluster methods such as 
those used in Q6). We believe our results show that the extended mean field approach, and in particular the Thouless-Anderson- 
Palmer one, may be a good starting point to theoretically analyze the performance of RBMs and deep belief networks. 
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